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A Unified Successive Pseudo-Convex 
Approximation Framework 

Yang Yang and Marius Pesavento 


Abstract —In this paper, we propose a successive pseudo-convex 
approximation algorithm to efficiently compute stationary points 
for a large class of possibly nonconvex optimization problems. 
The stationary points are obtained by solving a sequence of 
successively refined approximate problems, each of which is 
much easier to solve than the original problem. To achieve 
convergence, the approximate problem only needs to exhibit a 
weak form of convexity, namely, pseudo-convexity. We show that 
the proposed framework not only includes as special cases a 
number of existing methods, for example, the gradient method 
and the Jacobi algorithm, but also leads to new algorithms which 
enjoy easier implementation and faster convergence speed. We 
also propose a novel line search method for nondifferentiable 
optimization problems, which is carried out over a properly 
constructed differentiable function with the benefit of a simpli¬ 
fied implementation as compared to state-of-the-art line search 
techniques that directly operate on the original nondifferentiable 
objective function. The advantages of the proposed algorithm 
are shown, both theoretically and numerically, by several ex¬ 
ample applications, namely, MIMO broadcast channel capacity 
computation, energy efficiency maximization in massive MIMO 
systems and LASSO in sparse signal recovery. 

Index Terms —Energy efficiency, exact line search, LASSO, 
massive MIMO, MIMO broadcast channel, nonconvex optimiza¬ 
tion, nondifferentiable optimization, successive convex approxi¬ 
mation. 


1. Introduction 


In this paper, we propose an iterative algorithm to solve the 
following general optimization problem: 


minimize /(x) 

X 

subject to X G Y, 


( 1 ) 


where Y C is a closed and convex set, and /(x) : IZ^ 

7^ is a proper and differentiable function with a continuous 
gradient. We assume that problem Q has a solution. 

Problem Q also includes some class of nondifferentiable 
optimization problems, if the nondifferentiable function ^(x) 
is convex: 


because problem ^ can be rewritten into a problem with the 
form of 0 by the help of auxiliary variables: 


minimize /(x) + y 
subject to X G Y, g{x.) < y. 


( 3 ) 


We do not assume that /(x) is convex, so 0 is in general a 
nonconvex optimization problem. The focus of this paper is on 
the development of efficient iterative algorithms for computing 
the stationary points of problem Q. The optimization problem 
0 represents general class of optimization problems with a 
vast number of diverse applications. Consider for example 
the sum-rate maximization in the MIMO multiple access 
channel (MAC) Q, the broadcast channel (BC) Eia and the 
interference channel (IC) El S 01711191, where /(x) is the 
sum-rate function of multiple users (to be maximized) while 
the set Y characterizes the users’ power constraints. In the 
context of the MIMO IC, 0 is a nonconvex problem and NP- 
hard Q. As another example, consider portfolio optimization 
in which /(x) represents the expected return of the portfolio 
(to be maximized) and the set Y characterizes the trading 
constraints Gol. Furthermore, in sparse (/i-regularized) linear 
regression, /(x) denotes the least square function and g{x.) is 
the sparsity regularization function ifTTl fT^ . 

Commonly used iterative algorithms belong to the class of 
descent direction methods such as the conditional gradient 
method and the gradient projection method for the differen¬ 
tiable problem (0 ifTSll and the proximal gradient method for 
the nondifferentiable problem ^ [MKTsI, which often suffer 
from slow convergence. To speed up the convergence, the 
block coordinate descent (BCD) method that uses the notion of 
the nonlinear best-response has been widely studied M Sec. 
2.7]. In particular, this method is applicable if the constraint 
set of 0 has a Cartesian product structure Y = x ... x Xk 
such that 


minimize /(xi,..., x^) 

X=(Xfc)f^^ ^4^ 

subject to x/c G Y/c, /c = 1,..., 


minimize /(x) -1- g{x.) 

X 

subject to X e A", 


( 2 ) 
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The BCD method is an iterative algorithm: in each iteration, 
only one variable is updated by its best-response x^^^ = 
argminx.eA’fe /(xL\ • • ■ ■ ■ ■ ,^k) O-e- the 

point that minimizes /(x) with respect to (w.r.t.) the variable 
x/j; only while the remaining variables are fixed to their values 
of the preceding iteration) and the variables are updated se¬ 
quentially. This method and its variants have been successfully 
adopted to many practical problems |[Il|6l|71[l0l|l6l. 

When the number of variables is large, the convergence 
speed of the BCD method may be slow due to the sequential 
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nature of the update. A parallel variable update based on the 
best-response seems attractive as a mean to speed up the 
updating procedure, however, the convergence of a parallel 
best-response algorithm is only guaranteed under rather re¬ 
strictive conditions, c.f. the diagonal dominance condition on 
the objective function /(xi,... ,xx) (121 . which is not only 
difficult to satisfy but also hard to verify. If /(xi,... ,xx) 
is convex, the parallel algorithms converge if the stepsize 
is inversely proportional to the number of block variables 
K. This choice of stepsize, however, tends to be overly 
conservative in systems with a large number of block variables 
and inevitably slows down the convergence Enniiisi. 

A recent progress in parallel algorithms has been made in 
(8l|3[ia|20l, in which it was shown that the stationary point 
of Q can be found by solving a sequence of successively 
refined approximate problems of the original problem (0, and 
convergence to a stationary point is established if, among other 
conditions, the approximate function (the objective function 
of the approximate problem) and stepsizes are properly se¬ 
lected. The parallel algorithms proposed in Eiiaiiiiaoi are 
essentially descent direction methods. A description on how 
to construct the approximate problem such that the convexity 
of the original problem is preserved as much as possible is 
also contained in Elia [muni to achieve faster convergence 
than standard descent directions methods such as classical 
conditional gradient method and gradient projection method. 

Despite its novelty, the parallel algorithms proposed in 
EllllliaEol suffer from two limitations. Firstly, the approx¬ 
imate function must be strongly convex, and this is usually 
guaranteed by artificially adding a quadratic regularization 
term to the original objective function /(x), which how¬ 
ever may destroy the desirable characteristic structure of the 
original problem that could otherwise be exploited, e.g., to 
obtain computationally efficient closed-form solutions of the 
approximate problems i). Secondly, the algorithms require 
the use of a decreasing stepsize. On the one hand, a slow 
decay of the stepsize is preferable to make notable progress 
and to achieve satisfactory convergence speed; on the other 
hand, theoretical convergence is guaranteed only when the 
stepsize decays fast enough. In practice, it is a difficult task 
on its own to find a decay rate for the stepsize that provides 
a good trade-off between convergence speed and convergence 
guarantee, and current practices mainly rely on heuristics {]M . 

The contribution of this paper consists in the development 
of a novel iterative convex approximation method to solve 
problem Q. In particular, the advantages of the proposed 
iterative algorithm are the following: 

1) The approximate function of the original problem Q in 
each iteration only needs to exhibit a weak form of convexity, 
namely, pseudo-convexity. The proposed iterative method not 
only includes as special cases many existing methods, for 
example, (H |8l [9j [191, but also opens new possibilities for 
constructing approximate problems that are easier to solve. For 
example, in the MIMO BC sum-rate maximization problems 
(Sec. |IV-A| ), the new approximate problems can be solved 
in closed-form. We also show by a counterexample that the 
assumption on pseudo-convexity is tight in the sense that if it 
is not satisfied, the algorithm may not converge. 


2) The stepsizes can be determined based on the problem 
structure, typically resulting in faster convergence than in cases 
where constant stepsizes EKMlIll and decreasing stepsizes 
m [191 are used. For example, a constant stepsize can be used 
when /(x) is given as the difference of two convex functions 
as in DC programming ED. When the objective function 
is nondifferentiable, we propose a new exact/successive line 
search method that is carried out over a properly constructed 
differentiable function. Thus it is much easier to implement 
than state-of-the-art techniques that operate on the original 
nondifferentiable objective function directly. 


In the proposed algorithm, the exact/successive line search 
is used to determine the stepsize and it can be implemented in 
a centralized controller, whose existence presence is justified 
for particular applications, e.g., the base station in the MIMO 
BC, and the portfolio manager in multi-portfolio optimization 
(H. We remark that also in applications in which centralized 
controller are not admitted, however, the line search procedure 
does not necessarily imply an increased signaling burden when 
it is implemented in a distributed manner among different 
distributed processors. For example, in the LASSO problem 
studied in Sec. |IV-C[ the stepsize based on the exact line 
search can be computed in closed-form and it does not incur 
any additional signaling as in predetermined stepsizes, e.g., 
decreasing stepsizes and constant stepsizes. Besides, even 
in cases where the line search procedure induces additional 
signaling, the burden is often fully amortized by the significant 
increase in the convergence rate. 


The rest of the paper is organized as follows. In Sec. 
[I^ we introduce the mathematical background. The novel 
iterative method is proposed and its convergence is analyzed 
in Sec. m its connection to several existing descent direction 
algorithms is presented there. In Sec. IlYl several applications 
are considered: the sum rate maximization problem of MIMO 
BC, the energy efficiency maximization of a massive MIMO 
system to illustrate the advantage of the proposed approximate 
function, and the LASSO problem to illustrate the advantage 
of the proposed stepsize. The paper is finally concluded in 
Sec.lV] 


Notation: We use x, x and X to denote a scalar, vector 
and matrix, respectively. We use Xjk to denote the (j, k)- 
th element of X; Xk is the k-th element of x where x = 
{xk)^^i, and x_/j; denotes all elements of x except Xk'^-k = 
denote x“^ as the element-wise inverse of 
X, i.e., (x“^)/j; = l/xk- Notation x o y and X (g) Y denotes 
the Hadamard product between x and y, and the Kronecker 
product between X and Y, respectively. The operator [x]^ 
returns the element-wise projection of x onto [a, b]: [x]^ = 
max(min(x, b), a), and [x]^ = [x]q. We denote [x] as the 
smallest integer that is larger than or equal to x. We denote 
d(X) as the vector that consists of the diagonal elements of X 
and diag(x) is a diagonal matrix whose diagonal elements are 
as same as x. We use 1 to denote the vector whose elements 
are equal to 1. 


IT Preliminaries on Descent Direction Method 
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AND Convex Functions 

In this section, we introduce the basic definitions and 
concepts that are fundamental in the development of the 
mathematical formalism used in the rest of the paper. 

Stationary point. A point y G A' is a stationary point of 

0 if 

(x-y)^V/(y) > 0, Vx G A". (5) 

Condition 0 is the necessary condition for local optimality 
of the variable y. For nonconvex problems, where global 
optimality conditions are difficult to establish, the computation 
of stationary points of the optimization problem 0 is gener¬ 
ally desired. If 0 is convex, stationary points coincide with 
(globally) optimal points and condition 0 is also sufficient 
for y to be (globally) optimal. 

Descent direction. The vector is a descent direction of 
the function /(x) at x = if 

V/(x^)^d^ < 0. (6) 

If ^ is satisfied, the function /(x) can be decreased when x 
is updated from x^ along direction d^. This is because in the 
Taylor expansion of /(x) around x = x^ is given by: 

/(x^ + 7d') = /(x^) + 7V/(x^)^d^ + 0(7), 

where the first order term is negative in view of 0 . For 
sufficiently small 7, the first order term dominates all higher 
order terms. More rigorously, if d^ is a descent direction, there 
exists a 7 ^ > 0 such that m 8 . 2 . 1 ] 

/(x* + yd*) < /(x*), Vy G (0, V)- 

Note that the converse is not necessarily true, i.e., /(x^+^) < 
/(x^) for arbitrary functions /(x) does not necessarily imply 
that x^+^ — x^ is a descent direction of /(x) at x = x^. 

Quasi-convex function. A function h{x.) is quasi-convex if 
for any a G [ 0 , 1 ]: 

h{{l — (t)x + ay) < max(/i(x), /i(y)), Vx, y G A'. 



Figure 1. Relationship of functions with different degree of convexity 


It is strictly convex if the above inequality is satisfied with 
strict inequality whenever x 7 ^ y. It is easy to see that a convex 
function is pseudo-convex. 

Strongly convex functions. A function h{x.) is strongly 
convex with constant a if 

h{y) > h{x.) + V/i(x)^(y - x) + I ||y - x ||2 , Vx,y G V, 

for some positive constant a. The relationship of functions 
with different degree of convexity is summarized in Figure 
where the arrow denotes implication in the direction of the 
arrow. 


III. The Proposed Successive Pseudo-Convex 
Approximation Algorithm 


In this section, we propose an iterative algorithm that 
solves 0 as a sequence of successively refined approximate 
problems, each of which is much easier to solve than the 
original problem Q. e.g., the approximate problem can be 
decomposed into independent subproblems that might even 
exhibit closed-form solutions. 

In iteration t, let /(x;x^) be the approximate function of 
/(x) around the point x^. Then the approximate problem is 


minimize /(x;x^) 

X 

subject to X G A’, 


( 8 ) 


A locally optimal point y of a quasi-convex function h{x.) 
over a convex set A' is also globally optimal, i.e., 

h{x.) > /i(y), Vx G A'. 

Pseudo-convex function. A function /i(x) is pseudo-convex 
if |23l 

Vft.(x)'*’(y - x) > 0 h{y) > h{x), Vx, y G V. 

Another equivalent definition of pseudo-convex functions is 
also useful in our context ||23]| : 

h{y) < /i(x) ^ V/i(x)^(y - x) < 0. (7) 

In other words, h{y) < /i(x) implies that y — x is a descent 
direction of h{x.). A pseudo-convex function is also quasi- 
convex 1231 Th. 9.3.5], and thus any locally optimal points of 
pseudo-convex functions are also globally optimal. 

Convex function. A function /i(x) is convex if 

Hy) > + V/i(x)^(y - x), Vx,y G A". 


and its optimal point and solution set is denoted as ®x^ and 
5(x^), respectively: 

®x^ G S { x ^) = |x* G A' : /(x*;x^) = min /(x;x^)|. (9) 
I xGTf J 

We assume that the approximate function /(x; y) satisfies the 
following technical conditions: 

(Al) The approximate function /(x;y) is pseudo-convex in 
X for any given y G A'; 

(A2) The approximate function /(x; y) is continuously differ¬ 
entiable in X for any given y G A' and continuous in y for 
any x G A’; 

(A3) The gradient of /(x;y) and the gradient of /(x) are 
identical at x = y for any y G A", i.e., Vx/(y; y) = Vx/(y); 

Based on 0, we define the mapping Ex that is used to 
generate the sequence of points in the proposed algorithm: 

A' 3 X I —^ ®x G A'. (10) 

Given the mapping ®x, the following properties hold. 
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Proposition 1 (Stationary point and descent direction). Pro¬ 
vided that Assumptions (A1)-(A3) are satisfied: (i) A point y 
is a stationary point if and only if y ^ S{y) defined in 

0; (ii) If y is not a stationary point o/Q, then By — y is a 
descent direction of f{yi): 

V/(yf (By - y) < 0. (11) 

Proof: See Appendix [A| ■ 

If is not a stationary point, according to Proposition 
we define the vector update in the (t + l)-th iteration as: 

x^+i = + 7 ^(Bx^ - x^), (12) 

where 7 ^ G ( 0 , 1 ] is an appropriate stepsize that can be 
determined by either the exact line search (also known as the 
minimization rule) or the successive line search (also known 
as the Armijo rule). Since x^ G A', Bx^ G A’ and 7 ^ G (0,1], 
it follows from the convexity of A’ that x^+^ G A' for all t. 

Exact line search. The stepsize is selected such that the 
function /(x) is decreased to the largest extent along the 
descent direction Bx^ — x^: 

7 ^ G argmin /(x^ + 7 (Bx^ - x^)). (13) 

0<7<1 

With this stepsize rule, it is easy to see that if x^ is not a 
stationary point, then /(x^+^) < /(x^). 

In the special case that /(x) in § is convex and nulls 
the gradient of /(x^ + 7 (Bx^ — i.e., V^/(x^ + 7 *(Bx^ — 

x^)) = 0 , then 7 ^ in is simply the projection of 7 * onto 
the interval [ 0 , 1 ]: 

1 , if V^/(x*+ 7 (Bx*-x*)) 7 ^ > 0 , 

0, if V^/(x* + 7 (Bx* - x*))l^^o < 0, 

7 ^, otherwise. 

If 0 < 7 ^ = 7 "*" < 1, the constrained optimization problem 
in ( p^ is essentially unconstrained. In some applications it is 
possible to compute 7 * analytically, e.g., if /(x) is quadratic 
as in the LASSO problem (Sec. |IV-C| ). Otherwise, for general 
convex functions, can be found efficiently by the bisection 
method as follows. Restricting the function /(x) to a line 
x^ + 7 (Bx^ — x^), the new function /(x^ + 7 (Bx^ ~ ^^)) is 
convex in 7 1241 . It thus follows that V^/(x^+ 7 (Bx^—x^)) < 
0 if 7 < 7 "*" and V^/(x^ + 7 (Bx^ > 0 if 7 > y'^. Given 

an interval [ 7 iow 5 7 up] containing 7 * (the initial value of 710 W 
and 7 up is 0 and 1 , respectively), set y^id = ( 710 W + 7 up )/2 
and refine 710 W and 7 up according to the following rule: 

7low ~ 7 mid 5 if A 7mid(®^ ^ )) ^ 0? 

7up = 7mid, if V^/(x^ + 7mid(®X^ - x^)) < 0. 

The procedure is repeated for finite times until the gap 7 up — 
7 iow is smaller than a prescribed precision. 

Successive line search. If no structure in /(x) (e.g., con¬ 
vexity) can be exploited to efficiently compute y^ according 
to the exact line search ( pA] ), the successive line search can 
instead be employed: given scalars 0 < a < 1 and 0 < P < 1, 


/ = [7*]o = < 


Algorithm 1 The iterative convex approximation algorithm for 
differentiable problem Q 


Data: t = 0 and x^ G A'. 

Repeat the following steps until convergence: 

SI: Compute Bx^ using (|^. 

S2: Compute y^ by the exact line search ( p^ or the succes¬ 
sive line search tH- 

S3: Update x^+^ according to (12) and set t ^ f + 1. 


the stepsize y^ is set to be y^ = , where rut is the smallest 

nonnegative integer m satisfying the following inequality: 


/(x^ -h /3"^(Bx' - x')) < /(x^) + Q(^"^V/(x')^(Bx' - x'). 

(14) 

Note that the existence of a finite rrit satisfying O is 
always guaranteed if Bx^ — x^ is a descent direction at x^ 
and V/(x^)^(Bx^ — x^) <0 (131, i-^., from Proposition 
inequality GD always admits a solution. 

The algorithm is formally summarized in Algorithm and 
its convergence properties are given in the following theorem. 


Theorem 2 (Convergence to a stationary point). Consider 
the sequence {x^} generated by Algorithm Provided that 
Assumptions (A1)-(A3) as well as the following assumptions 
are satisfied: 

(A4) The solution set S{yf) is nonempty for t = 1, 2,...; 
(A5) Given any convergent subsequence where T C 

{1,2,...}, the sequence {®X*}t£7- is bounded. 

Then any limit point of {x^} is a stationary point 


Proof: See Appendix [B| ■ 

In the following we discuss some properties of the proposed 
Algorithm 

On the conditions (A1)-(A5). The only requirement on 
the convexity of the approximate function /(x;x^) is that it 
is pseudo-convex, cf. (Al). To the best of our knowledge, 
these are the weakest conditions for descent direction meth¬ 
ods available in the literature. As a result, it enables the 
construction of new approximate functions that can often be 
optimized more easily or even in closed-form, resulting in a 
significant reduction of the computational cost. Assumptions 
(A2)-(A3) represent standard conditions for successive convex 
approximation techniques and are satisfied for many existing 
approximation functions, cf. Sec. |III-B[ Sufficient conditions 
for Assumptions (A4)-(A5) are that either the feasible set A' in 
^ is bounded or the approximate function in ^ is strongly 
convex flSl . We show that how these assumptions are satisfied 


in popular applications considered in Sec. IV 


On the pseudo-convexity of the approximate function. 

Assumption (Al) is tight in the sense that if it is not satisfied. 
Proposition may not hold. Consider the following simple 
example: f{x) = x^, where — 1 < x < 1 and the point 
x^ = 0 at iteration t. Choosing the approximate function 
f{x;x^) = which is quasi-convex but not pseudo-convex, 
all assumptions except (Al) are satisfied. It is easy to see that 


Mx^ = — 1 , however 


-x^)\/f{x^) = (-1 - 0 ) - 0 = 0 , 
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and thus Mx^ — is not a descent direction, i.e., inequality 
0 in Proposition is violated. 

On the stepsize. The stepsize can be determined in a more 
straightforward way if /(x;x^) is a global upper bound of 
/(x) that is exact at x = x^, i.e., assume that 
(A 6 ) /(x;x^) > /(x) and /(x^;x^) = /(x^), 
then Algorithm [T] converges under the choice 7 ^ = 1 which 
results in the update x^+^ = Bx^. To see this, we first remark 
that 7 ^ = 1 must be an optimal point of the following problem: 


Algorithm 2 The iterative convex approximation algorithm for 
nondifferentiable problem Q 


Data: t = 0 and x^ G A'. 

Repeat the following steps until convergence: 

SI: Compute Bx^ using ( 22 ). 

S2: Compute 7 ^ by the exact line search (^) or the succes¬ 
sive line search ( [24| ). 

S3: Update x^+^ according to 


= x^ + 7 ^(Bx^ - x^). 


1 G argmin /(x^ + 7 (Bx^ — x^); x^), (15) 

0<7<1 

otherwise the optimality of Bx^ is contradicted, cf. <§• 
At the same time, it follows from Proposition that 
V/(x^; x^)^(Bx^ — x^) < 0. The successive line search over 
/(x^+ 7 (Bx^— x^)) thus yields a nonnegative and finite integer 
rrit such that for some 0 < a < 1 and 0 < < 1 : 

/(Bx^x^) < /(x^ -h - x^);x^) 

< /V) + V/(x^ x')^(Bx' - x') 

= /(x^) -h a/3^* V/(x^)^(Bx^ - x^), (16) 

where the second inequality comes from the definition of 
successive line search [cf. ( p^ ] and the last equality follows 
from Assumptions (A3) and (A 6 ). Invoking Assumption (A 6 ) 
again, we obtain 

/(x‘+i) < /(x‘) + a^™*V/(x‘)^(Bx‘ - . 

(17) 

The proof of Theorem can be used verbatim to prove the 
convergence of Algorithm with a constant stepsize 7 ^ = 1. 

A. Nondifferentiable Optimization Problems 

In the following we show that the proposed Algorithm 
can also be applied to solve problem ([^, and its equivalent 
formulation © which contains a nondifferentiable objective 
function. Suppose that /(x; x^) is an approximate function of 
/(x) in ^ around x^ and it satisfies Assumptions (A1)-(A3). 
Then the approximation of problem ^ around (x^,^^) is 

(Bx*,y*(x*)) 4 argmin /(x;x*) + 2 /. (18) 

(x,i/):xeA’,^(x)<y 

That is, we only need to replace the differentiable function 
/(x) by its approximate function /(x;x^). To see this, it is 
sufficient to verify Assumption (A3) only: 

Vx(/(x*; X*) + y) = Vx(/(x*) + y*), 

Vy(/(x*; X*) +y) = Vy{f{x*) + y) = l. 

Based on the exact line search, the stepsize in this case is 
given as 

7 * G argmin{/(x*+ 7 (Bx*-x*))+y‘+ 7 (y*(x*)-y*))}, (19) 
0<7<1 


Set t i — f “h 1. 


The convergence of Algorithm with (Bx*,y*(x*)) and 7 * 
given hy ([T8l)-l|19|l directly follo ws from Theorem 
The point given in ( |20b| ) can he further refined: 

/(x*+i) + = /(x‘+i) + y‘ + 7*(l/*(x*) - y*) 

> /(x‘+i) + y(x‘) + 7 *( 5 (Bx*) - y(x*)) 

> /(x*+i) + y((l - 7*)x* + 7 *Bx*) 

= /(x‘+i)+y(x‘+i), 

where the first and the second inequality comes from the fact 
that > ^(x^) as well as = ^(Bx^) and Jensen’s 

inequality of convex functions ^(x) 1^ . respectively. Since 
yt-\-i ^ ^(x^+^) by definition, the point (x^+^,^(x^+^)) 
always yields a lower value of /(x) + y than (x^+^,^^+^) 
while (x^+^, ^(x^+^)) is still a feasible point for problem 
The update ( [20b| ) is then replaced by the following enhanced 
rule: 

( 21 ) 


y 


t+i _ 




Algorithm with Bx^ given in ( |20a| ) and y^~^^ given in (21) 
still converges to a stationary point of 0 - 

The notation in (p^-(p^ can be simplified by removing the 
auxiliary variable y: Bx^ in ( p^ can be equivalently written 
as 

Bx^ = argmin{/(x; x^) + ^(x)} (22) 

xG-V 

and combining ( p^ and ^2T\ yields 

7 ^ G argmin{/(x^+ 7 (Bx^— x^))+ 7 (^(Bx^)—^( x^))}. (23) 
0<7<1 

In the context of the successive line search, customizing 
the general definition ( p^ for problem ^ yields the choice 
7 ^ = being the smallest integer that satisfies the 

inequality: 


/(x^ -h /3^(Bx^ - x^)) - /(x^) < 


/3™(aV/(x*) 


t\Tf 




■x‘)+(q;- l)(y(Bx‘) 


9(x‘))). 

(24) 


where y^ > g{y^). Then the variables x^+^ and y^^^ are 
defined as follows: 


= x^ + 7 ^(Bx^ - x^), ( 20 a) 

y*+i = y* + 7 *(y*(x*) - y*). ( 20 b) 


Based on the derivations above, the proposed algorithm for 
the nondifferentiable problem ^ is formally summarized in 
Algorithm 

It is much easier to calculate according to (23) than in 
state-of-the-art techniques that directly carry out the exact line 
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search over the original nondifferentiable objective function in 
0 EH Rule E], i.e., 


- x^)) + g{x^ + 7 (Bx^ - x^)). 


min f(x^ + 

0<7<1 

This is because the objective function in ( [2^ is differentiable 
in 7 while state-of-the-art techniques involve the minimization 
of a nondifferentiable function. If /(x) exhibits a specific 
structure such as in quadratic functions, 7 ^ can even be 
calculated in closed-form. This property will be exploited 
to develop fast and easily implementable algorithm for the 


popular LASSO problem in Sec. IV-C 


In the proposed successive line search, the left hand side 


of (24) depends on /(x) while the right hand side is linear 
in The proposed variation of the successive line search 
thus involves only the evaluation of the differentiable function 
/(x) and its computational complexity and signaling exchange 
(when implemented in a distributed manner) is thus lower than 
state-of-the-art techniques (for example (23 Rule A’], IZTl 
Equations (9)-(10)], (13 Remark 4] and (28| Algorithm 2.1]), 
in which the whole nondifferentiable function /(x) + g{x.) 
must be repeatedly evaluated (for different m) and compared 
with a certain benchmark before rrit is found. 


B . Special Cases and New Algorithms 

In this subsection, we interpret some existing methods in the 
context of Algorithm and show that they can be considered 
as special cases of the proposed algorithm. 

Conditional gradient method: In this iterative algorithm 
for problem the approximate function is given as the first- 
order approximation of /(x) at x = x^ (TS] Sec. 2.2.2], i.e., 

/(x;x‘) = V/(x‘7(x-x‘). (25) 

Then the stepsize is selected by either the exact line search or 
the successive line search. 

Gradient projection method: In this iterative algorithm for 
problem ([^, ®x^ is given by (TSl Sec. 2.3] 

®x'= [x'-s'V/(x')]^, 

where > 0 and [x]^ denotes the projection of x onto A'. 
This is equivalent to defining /(x;x^) in ([^ as follows: 

/(x; X*) = V/(x*)^(x - X*) + ^ ||x - x‘||J , (26) 

which is the first-order approximation of /(x) augmented by a 
quadratic regularization term that is introduced to improve the 
numerical stability El. A generalization of ( [^ is to replace 
the quadratic term by (x —x^)H^(x —x^) where 0 1271 . 

Proximal gradient method: If /(x) is convex and has a 
Lipschitz continuous gradient with a constant L, the proximal 
gradient method for problem ^ has the following form (HI 
Sec. 4.2]: 

x‘+i = argmin {s7(x) + ^||x - (x* - s‘V/(x*))|f } 

= argmin |v/(x*)(x - x*) + ^||x-x*||^ +fi'(x)|. 

( 27 ) 


where 5 ^ > 0. In the context of the proposed framework (22), 
the update (27) is equivalent to defining /(x;x^) as follows: 


/(x; X*) = V/(x)^(x - x‘) + ^ ||x - x^ 


(28) 


and setting the stepsize 7 ^ = 1 for all t. According to Theorem 
[^and the discussion following Assumption (A6), the proposed 
algorithm converges under a constant unit stepsize if /(x; x^) 
is a global upper bound of /(x), which is indeed the case when 
< 1/L in view of the descent lemma (TSl Prop. A.24]. 
Jacobi algorithm: In problem ([^, if /(x) is convex in each 
x/c where /c = 1,..., (but not necessarily jointly convex in 
(xi,..., xx)), the approximate function is defined as m 

/(x;x‘) = Ef=i(/(xfc,x*_j^) + ^ \\xk-A\\l), (29) 

where Tk > 0 for /c = 1,..., AT. The k-th component function 
/(x/c,xi^) -h ^ ||x/c — x ^||2 in ( |29[ ) is obtained from the 
original function /(x) by fixing alTvariables except xj., i.e., 
x_/c = x^^, and further adding a quadratic regularization 
term. Since /(x;x^) in (29) is convex. Assumption (Al) is 
satisfied. Based on the observations that 


Vxfc/(x*;x*) = Vx7/(xfe,x*_j.) + 7 ||xfe - 
= Vxfc/(xfe,x‘_j.) + Tfe(xfe - : 

= Vx./(V), 


X 


^ II2 


4) t 

IXfc=X^ 


) lxfc=x 


we conclude that Assumption (A3) is satisfied by the choice of 
the approximate function in ( [29| )- The resulting approximate 
problem is given by 


minimize 


Ef=l(/(Xfc,X*_fc) + 7 
subject to X G A'. 


|Xfe -x^f) 


(30) 


This is commonly known as the Jacobi algorithm. The struc¬ 
ture inside the constraint set A', if any, may be exploited 


to solve (30) even more efficiently. Eor example, the con¬ 
straint set A' consists of separable constraints in the form of 
^/c(x/c) < 0 for some convex functions hk{'^k)- Since 
subproblem ( [^ is convex, primal and dual decomposition 
techniques can readily be used to solve ( [^ efficiently (^ 
(such an example is studied in Sec. |IV-A| ). 

To guarantee the convergence, the condition proposed in 
ii is that T/c > 0 for all k in ( [^ unless /(x) is strongly 
convex in each x/.. However, the strong convexity of /(x) 
in each x/. is a strong assumption that cannot always be 
satisfied and the additional quadratic regularization term that is 
otherwise required may destroy the convenient structure that 
could otherwise be exploited, as we will show through an 
example application in the MIMO BC in Sec. |IV-A| In the 
case T/c = 0, convergence of the Jacobi algorithm ( [^ is only 
proved when /(x) is jointly convex in (xi,..., x^^) and the 
stepsize is inversely proportional to the number of variables 
K (21 [TOl |T3l, namely, 7 ^ = 1/K. However, the resulting 
convergence speed is usually slow when K is large, as we 
will later demonstrate numerically in Sec. |IV-A[ 

With the technical assumptions specified in Theorem the 
convergence of the Jacobi algorithm with the approximate 
problem ( [^ and successive line search is guaranteed even 
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Algorithm 3 The Jacobi algorithm for problem 0 

Data: t = 0 and G A/, for all /c = 1,..., if. 
Repeat the following steps until convergence: 


SI: For /c = 1,..., if, compute using (31). 

S2: Compute 7 ^ by the exact line search ( jB ) or the succes¬ 
sive line search 

S3: Update according to 


X 


t+i _ 
k — 


7 ^(B/,x^ -x^),VA: = l,...,if. 


Set t i — f H“ 1. 


when Tk = 0. This is because /(x;x^) in (29) is already 


convex when = 0 for all k and it naturally satisfies the 
pseudo-convexity assumption specified by Assumption (Al). 

In the case that the constraint set A' has a Cartesian product 
structure the subproblem ( [^ is naturally decomposed into 
K sub-problems, one for each variable, which are then solved 
in parallel. In this case, the requirement in the convexity of 
/(x) in each x/^ can even be relaxed to pseudo-convexity 
only (although the sum function Ylk=i is not 

necessarily pseudo-convex in x as pseudo-convexity is not 
preserved under nonnegative weighted sum operator), and this 
leads to the following update: Bx^ = (B/^x^)^^ and 


G argmin/(xfe,x*_j.), k = l,. 


,if, 


(31) 


where 
to other 


x^ can be interpreted as variable x/^’s best-response 


variables x_/c = (xj)^^/^ when x_/c = x^ 


-k' 


The proposed Jacobi algorithm is formally summarized in 
Algorithm and its convergence is proved in Theorem 

Theorem 3. Consider the sequence {x^} generated by Algo¬ 
rithm 1^ Provided that /(x) is pseudo-convex in x/^ for all 
k = 1,..., if and Assumptions (A4)-(A5) are satisfied. Then 
any limit point of the sequence generated by Algorithm is a 
stationary point 

Proof: See Appendix [C| ■ 

The convergence condition specified in Theorem relaxes 
those in |[8l[T9l: /(x) only needs to be pseudo-convex in each 
variable x/^ and no regularization term is needed (i.e., Tk = 0 ). 
To the best of our knowledge, this is the weakest convergence 
condition on Jacobi algorithms available in the literature. We 
will show in Sec. IV-B| by an example application of the energy 
efficiency maximization problem in massive MIMO systems 
how the weak assumption on the approximate function’s 
convexity proposed in Theorem can be exploited to the 
largest extent. Besides this, the line search usually yields much 
faster convergence than the fixed stepsize adopted in O even 
under the same approximate problem, cf. Sec. |IV-A| 

DC algorithm: If the objective function in ^ is the 
difference of two convex functions /i(x) and / 2 (x): 

/(x) = /l(x) - / 2 (x), 

the following approximate function can be used: 

/(x;x*) = /i(x) - (/ 2 (x‘) + V/ 2 (x*)^(x - X*)). 


Since / 2 (x) is convex and / 2 (x) > / 2 (x^) + V/ 2 (x^)^(x — 
x^). Assumption (A 6 ) is satisfied and the constant unit stepsize 
can be chosen ll^ . 


IV. Example Applications 
A. MIMO Broadcast Channel Capacity Computation 

In this subsection, we study the MIMO BC capacity com¬ 
putation problem to illustrate the advantage of the proposed 
approximate function. 

Consider a MIMO BC where the channel matrix charac¬ 
terizing the transmission from the base station to user k is 
denoted by Hk, the transmit covariance matrix of the signal 
from the base station to user k is denoted as Q/^, and the 
noise at each user k is an additive independent and identically 
distributed Gaussian vector with unit variance on each of its 
elements. Then the sum capacity of the MIMO BC is OTI 

maximize log|l + Ef=iHfeQfeHf | 

{Qfc} 

subject to QkhO, k = l,...,K, < P, (32) 

where P is the power budget at the base station. 

Problem ( [^ is a convex problem whose solution cannot be 
expressed in closed-form and can only be found iteratively. To 
apply Algorithmic we invoke ([29|)-([^ and the approximate 
problem at the t-th iteration is 

maximize Y.k=i log|R-fc(Q-fe) + HfcQfeHf | 

{Qfc} 

subject to Q/c ^ 0, /c = 1,..., if, Z]f=itr(Q/c) < P, 

(33) 


where 'Rk{Q^_k) = I + . The approximate 

function is concave in Q and differentiable in both Q and 
Q^, and thus Assumptions (A1)-(A3) are satisfied. Since the 
constraint set in ( [3^ is compact, the approximate problem 
(33) has a solution and Assumptions (A4)-(A5) are satisfied. 

Problem ( [^ is convex and the sum-power constraint 
coupling Qi,...,Qk is separable, so dual decomposition 
techniques can be used 1291 . In particular, the constraint set 
has a nonempty interior, so strong duality holds and ( [^ can 
be solved from the dual domain by relaxing the sum-power 
constraint into the Lagrangian 


BQ^ 


arg max 
(Qfc— 


Ef=ilog|Rfc(Q7) + HfeQfeHf 

-A*(Ef=itr(Qfc)-P) 


(34) 

where BQ^ = (^kQ^)^=i and A* is the optimal Lagrange 
multiplier that satisfies the following conditions: A* > 0 , 
Ef=i tr(BfeQ*) - P < 0, A*(Ef=i tr(BfeQ‘) - P) = 0, and 
can be found efficiently using the bisection method. 

The problem in ( [34| ) is uncoupled among different variables 
Q/c in both the objective function and the constraint set, so it 
can be decomposed into a set of smaller subproblems which 
are solved in parallel: BQ^ = {MkQ^)k=i 


BfcQ* = argmax{log |Rfc(Q‘_j.) + | -A*tr(Qfc)}, 

QfcGo 

(35) 
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and B/cQ^ exhibits a closed-form expression based on the 
waterfilling solution O . Thus problem ( [3^ also has a closed- 
form solution up to a Lagrange multiplier that can be found 
efficiently using the bisection method. With the update direc¬ 
tion BQ^ — Q^, the base station can implement the exact line 
search to determine the stepsize using the bisection method 
described after ( p^ in Sec. |I^ 

We remark that when the channel matrices H/. are rank 
deficient, problem ( [33] ) is convex but not strongly convex, 
but the proposed algorithm with the approximate problem 
(|3^ still converges. However, if the approximate function in 


|8i| is used [cf. (29)], an additional quadratic regularization 
term must be included into ( [3^ (and thus ( [^ ) to make 
the approximate problem strongly convex, but the resulting 
approximate problem no longer exhibits a closed-form solution 
and thus are much more difficult to solve. 

Simulations. The parameters are set as follows. The number 
of users is if = 20 and K = 100, the number of transmit 
and receive antenna is (5,4), and P = 10 dB. The simulation 
results are averaged over 20 instances. 

We apply Algorithm with approximate problem ( [3^ and 
stepsize based on the exact line search, and compare it with the 
iterative algorithm proposed in O [TSl, which uses the same 
approximate problem (33) but with a fixed stepsize 7 ^ = 1/K 
(K is the number of users). It is easy to see from Figure [^ 
that the proposed method converges very fast (in less than 
10 iterations) to the sum capacity, while the method of ll^ 
requires many more iterations. This is due to the benefit of 
the exact line search applied in our algorithm over the fixed 
stepsize which tends to be overly conservative. Employing the 
exact line search adds complexity as compared to the simple 
choice of a fixed stepsize, however, since the objective function 
of ( [ 3 ^ is concave, the exact line search consists in maximizing 
a differentiable concave function with a scalar variable, and 
it can be solved efficiently by the bisection method with 
affordable cost. More specifically, it takes 0.0023 seconds 
to solve problem ( [3^ and 0.0018 seconds to perform the 
exact line search (the software/hardware environment is further 
specified in Sec. |IV-C|). Therefore, the overall CPU time 


(time per iteration x number of iterations) is still dramatically 
decreased due to the notable reduction in the number of 
iterations. Besides, in contrast to the method of m, increasing 
the number of users K does not slow down the convergence, 
so the proposed algorithm is scalable in large networks. 

We also compare the proposed algorithm with the itera¬ 
tive algorithm of 1201 . which uses the approximate problem 
(33) but with an additional quadratic regularization term, cf. 
{2^, where Tk = 10“^ for all k, and decreasing stepsizes 


= 7^(1 —dy^) where d = 0.01 is the so-called decreasing 
rate that controls the rate of decrease in the stepsize. We 
can see from Figure that the convergence behavior of 
1^ is rather sensitive to the decreasing rate d. The choice 
d = 0.01 performs well when the number of transmit and 
receive antennas is 5 and 4, respectively, but it is no longer a 
good choice when the number of transmit and receive antenna 
increases to 10 and 8, respectively. A good decreasing rate d is 
usually dependent on the problem parameters and no general 
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Figure 3. MIMO BC: error e(Q*) = 3J(tr(V/(Q*)(BQ* - Q*))) versus 
the number of iterations. 


rule performs equally well for all choices of parameters. 

We remark once again that the complexity of each iteration 
of the proposed algorithm is very low because of the exis¬ 
tence of a closed-form solution to the approximate problem 
( |3^ , while the approximate problem proposed in 1201 does 
not exhibit a closed-form solution and can only be solved 
iteratively. Specifically, it takes CVX (version 2.0 021^21.1785 
seconds (based on the dual approach p5] i where A* is found 
by bisection). Therefore, the overall complexity per iteration 
of the proposed algorithm is much lower than that of 1 ^ . 

B. Energy Efficiency Maximization in Massive MIMO Systems 

In this subsection, we study the energy efficiency maximiza¬ 
tion problem in massive MIMO systems to illustrate the advan¬ 
tage of the relaxed convexity requirement of the approximate 
function /(x;x^) in the proposed iterative optimization ap¬ 
proach: according to Assumption (Al), /(x; x^) only needs to 
exhibit the pseudo-convexity property rather than the convexity 
or strong convexity property that is conventionally required. 
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Consider the massive MIMO network with K cells and each 
cell serves one user. The achievable transmission rate for user 
k in the uplink can be formulated into the following general 
fornfl 


r/e(p) = log 1 + 


"^kkPk 


H“ 4^kPk ^kjPj 


(36) 


where pk is the transmission power for user k, is the 
covariance of the additive noise at the receiver of user k, 
while (pk and {wkj}k,j are positive constants that depend on 
the channel conditions only. In particular, <pkPk accounts for 
the hardware impairments, and ^j^k ^kjPj accounts for the 
interference from other users Ea. 

In 5G wireless communication networks, the energy effi¬ 
ciency is a key performance indicator. To address this issue, 
we look for the optimal power allocation that maximizes the 
energy efficiency: 


maximize 

p 

subject to 


Etirkip) 

Pc + Pk 

p < p < p, 


(37) 


where Pc is a positive constant representing the total cir¬ 
cuit power dissipated in the network, p = {PjJk=i 
P = {Pk)k=i specifies the lower and upper bound constraint, 
respectively. 

Problem ( [TT] ) is nonconvex and it is a NP-hard problem to 
find a globally optimal point O. Therefore we aim at finding 
a stationary point of ( [TT] ) using the proposed algorithm. To 
begin with, we propose the following approximate function at 

P = P*: 




■ "^j^k Pj 


(38) 


where 


TkiPk^P ) =rk{Pk.P-k)P^{rj{p )p{Pk-Pk)'^Pk'^j{P ))• 

(39) 

and 


Vp,rj(p^)) = 


Wjk 


A= 
Wjk 


d] + (pjPj + E/=i ^jiPi 


-h (l)jPj + Wjipi 

Note that the approximate function /(p; p^) consists of K 
component functions, one for each variable pk, and the k- 
th component function is constructed as follows: since r/c(p) 
is concave in pk (shown in the right column of this page) 
but rj{p) is not concave in pk (as a matter of fact, it 
is convex in pk), the concave function rk{pkiP^-k) 


served in rk{Pk'-,P^) in (39) with p-k fixed to be 


while the nonconcave functions {rj(p)} are linearized w.r.t. 
Pk at p = p^. In this way, the partial concavity in the 
nonconcave function Ylf=i'kj{p) is preserved in /(p;p^). 


^We assume a single resource block. However, the extension to multiple 
resource blocks is straightforward. 


Similarly, since Pc + Ylf=i i^ denominator is linear 
in Pk, we only set p-k = P^-k- division 

operator in the original problem ( [J7| ) is kept in the approximate 
function /(p; p^) (38). Although it will destroy the concavity 
(recall that a concave function divided by a linear function 
is no longer a concave function), the pseudo-concavity of 
rk{Pk;P*)/{Pc+Pk + J2j^kPj) is siiil preserved, as we show 
in two steps. 

Step 1: The function rk{pk^P^-k) concave in pk. For 
the simplicity of notation, we define two constants ci = 
Wkk/(t>k > 0 and C 2 = {(^l+Y^j^k WkjpJ'^)/<l>k > 0. The first- 
order derivative and second-order derivative of rk{pk-,P^-k) 
w.r.t. Pk are 

1 + Cl 1 


^p^rkiPk^P-k) 


(1- 


^pMPk^P-k) = 


Cl)pk + C2 
(1 + Cl)^ 


Pfe + C2 ' 


-h 


Cl)pk + C2Y (Pfe + C2)2 

-2ciC2Pfc(l -h Cl) - (cf -h 2ci)c| 
{{l + Ci)pk+C2Y{Pk + C2Y 


Since Vljk{Pk,ptk) < 0 when p > 0, rk{pk,ptk) is a 
concave function of pk in the nonnegative orthant p/c > 0 

El. 

Step 2: Given the concavity of rk{pk^P^-k)^ l^e 
functionf/c(p/c; p^) is concave in pk. Since the denominator 
function of fk{Pk'’>P^) is a linear (and thus convex) function 
of Pk, it follows from O Lemma 3.8] that rk{Pk] P^)/{Pc + 
Pk + Y^j^kP^j) pseudo-concave. 

Then we verify that the gradient of the approximate function 
and that of the original objective function are identical at p = 
p^. It follows that 


Vpfc/(p;p*) 


' Pk 


4(pfc;p‘) 


Pc+Pk + T.j^kP] 


K 

^ Pk' J 

i=i 


Vpfcrj(p*)(Pc + l^P*) - rj(p*) 


Pk=P'k 

•i( 


(Pc + l^p‘)2 


= V 


Pk 


' Ef=ir,(p) 
+ Yj = l Pj 


^k. 


Therefore Assumption (A3) in Theorem is satisfied. As¬ 
sumption (A2) is also satisfied because both rk{Pk]P^) and 
P/c + Pc + Yj^kP] continuously differentiable for any 

> 0 . 

Given the approximate function ( [3^ , the approximate prob¬ 
lem in iteration t is thus 


Bp^ = arg max 

p<p<p 


E 


^fc(pfc;p*) 

Pc + Pk + T,j^kPy 


(40) 


Assumptions (A4) and (A5) can be proved to hold in a similar 
procedure shown in the previous example application. Since 
the objective function in ( [J7] ) is nonconcave, it may not be 
computationally affordable to perform the exact line search. 
Instead, the successive line search can be applied to calculate 
the stepsize. As a result, the convergence of the proposed 
algorithm with approximate problem ( [40| ) and successive line 
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PkiK 


t,r\ 


intfe(p*) 


20/c( tT/c( p^) Xf^ )(0/c “1“ '^/c/c) 


Pk 


(43) 


search follows from the same line of analysis used in the proof 
of Theorem O 

The optimization problem in ( |40| ) can be decomposed into 
independent subproblems ( [4T] ) that can be solved in parallel: 

TO t rk{pk;p^) . , //l1^ 

MkP = argmax -- k = l,...,K, (41) 

P^<Pk<Pk + P/c + ^jj^kPj 

where Bp^ = {^kP^)^=i- As we have just shown, the 
numerator function and the denominator function in is 
concave and linear, respectively, so the optimization problem 
in ( [4T] ) is a fractional programming problem and can be solved 
by the Dinkelbach’s algorithm (331 Algorithm 5]: given 
(A^’^ can be set to 0), the following optimization problem in 
iteration r + 1 is solved: 


Pk{>^k^) = argmax VkiPk^P^) - A^’^(Pc +p/c + T^j^kPj)^ 

Pj,<Pk<Pk 

(42a) 

where rk{pk]P^) is the numerator function in (41). The 
variable AI’^ is then updated in iteration r + 1 as 




-Pc +PA:(A^’^) + 


(42b) 


It follows from the convergence properties of the Dinkelbach’s 
algorithm that 


lim Pfe(A^’^) = Bfep* 


at a superlinear convergence rate. Note that problem ( [42a| ) 
can be solved in closed-form, as P/c(A^’^) is simply the 
projection of the point that sets the gradient of the objective 
function in (42a) to zero onto the interval [Pj^^Pk]- H can 
be verified that finding that point is equivalent to finding 
the root of a polynomial with order 2 and it thus admits 
a closed-form expression. We omit the detailed derivations 
and directly give the expression of P/c(A^’^) in Um at the 
top of this page, where 7rk{p^) = J2jj^k"^Pk'^jiP^) 


intfc(p*) = cr| + Yjjtk'^kjPj- 
We finally remark that the approximate function in ( [38] ) is 
constructed in the same spirit as la in Ea by keeping as 
much concavity as possible, namely, rk{p) in the numerator 
and Pc^Ylf=i Pj denominator, and linearizing the non¬ 

concave functions only, namely, ^j^k (p) numerator. 
Besides this, the denominator function is also kept. Therefore, 
the proposed algorithm is of a best-response nature and ex¬ 
pected to converge faster than gradient based algorithms which 
linearizes the objective function Ylf=i (p)/(^c + Ylf=i Pj) 
in completely. However, the convergence of the proposed 
algorithm with the approximate problem given in ( |40| cannot 
be derived from existing works, since the approximate function 


presents only a weak form of convexity, namely, the pseudo¬ 
convexity, which is much weaker than those required in state- 
of-the-art convergence analysis, e.g., uniform strong convexity 
in (SI. 

Simulations. The number of antennas at the BS in each 
cell is M = 50, and the channel from user j to cell 
k is h.kj ^ We assume a similar setup as 


h.^iMkk^, Wkj = \h^k^kj\^ + ^^kk^j^kk for 

= rh|^D/ch/c/c, 


'^kk = \^^k\^^kk 

j k and (j)k = eh.^jJ^khkk^ where e = 0.01 is the 
error magnitude of hardware impairments at the BS and 
Dj = diag({|/ijj(m)p}^^i). The noise covariance = 1, 
and the hardware dissipated power pc is lOdBm, while p^ 
is -lOdBm and pj^ is lOdBm for all users. The benchmark 
algorithm is |33l Algorithm 1], which successively maximizes 
the following lower bound function of the objective function 
in ([T^, which is tight at p = p^: 


Yk=lK + 0'PogWkk 


maximize 

q 


Ef=ie«^ 


+ 


Ef=i aiiqk - log(cr| + + Yjjik '^kje‘‘^)) 


Pc + Yk=l 6'^'“ 

subject to log(p ) <qk< log(Pfe), k = l,...,K, 


(44) 


where 


ai = 


sinrfc(p*) 
l + sinrfe(p*)’ 


bk =log(l + sinrfe(p )) - - 


sinrfc(p*) 
+ sinrfc(p* 


-log(sinrfe(p*)), 


and 


sinr/,(p) = 


WkkP' 


crl + (t>kPk + ^kjPj ' 


Denote the optimal variable of (|44|) as (which can be 
found by the Dinkelbach’s algorithm); then the variable p is 
updated as for all k = 1,..., if. We thus coin 

(33] Algorithm 1] as the successive lower bound maximization 
(SLBM) method. 

In Figure [^ we compare the convergence behavior of the 
proposed method and the SLBM method in terms of both the 
number of iterations (the upper subplots) and the CPU time 
(the lower subplots), for two different number of users: K = 
10 in Figure]^ (a) and if = 50 in Figure |^(b). It is obvious that 
the convergence speed of the proposed algorithm in terms of 
the number of iterations is comparable to that of the SLBM 
method. However, we remark that the approximate problem 


(40) of the proposed algorithm is superior to that of the SLBM 
method in the following aspects: 

Firstly, the approximate problem of the proposed algorithm 
consists of independent subproblems that can be solved in 
parallel, cf. while each subproblem has a closed-form 
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Figure 4. Energy Efficiency Maximization: achieved energy efficiency versus the number of iterations 


solution, cf. (|45-(|4^. However, the optimization variable in 
the approximate problem of the SLBM method ( |44| ) is a vector 
q G and the approximate problem can only be solved 

by a general purpose solver. 

In the simulations, we use the Matlab optimization toolbox 
to solve {4^ and the iterative update specified in ([4^-([43)) to 
solve (pO), where the stopping criterion for (42) is < 

10“^. Tne upper subplots in Figure show mat the numbers 
of iterations required for convergence are approximately the 
same for the SLBM method when K = 10 in Figure (a) 
and when K = 50 in Figure |^(b). However, we see from the 
lower subplots in Figure]^ that the CPU time of each iteration 
of the SLBM method is dramatically increased when K is 
increased from 10 to 50. On the other hand, the CPU time 
of the proposed algorithm is not notably changed because the 
operations are parallelizabl^ and the required CPU time is 
thus not affected by the problem size. 

Secondly, since a variable substitution qk = is adopted 
in the SLBM method (we refer to 13^ for more details), the 
lower bound constraint = 0 (which corresponds to qk = 
—oo) cannot be handled by the SLBM method numerically. 
This limitation impairs the applicability of the SLBM method 
in many practical scenarios. 


sparse signal recovery EH [El EH ED: 

minimize | ||Ax - b|| 2 -l-||x||;^, (45) 

where A e (with N <C K), b e and fi > 

0 are given parameters. Problem is an instance of the 
general problem structure defined in ^ with the following 
decomposition: 

/(x) = i ||Ax-b|| 2 , and 5 ((x) =/u ||x||i. (46) 

Problem ( [45] ) is convex, but its objective function is non- 
differentiable and it does not have a closed-form solution. To 
apply Algorithm]^ the scalar decomposition x = {xk)k=i is 
adopted. Recalling ( [22] ) and ( |29| ), the approximate problem is 


lx* = argmin {Efe=i/(a;fe,+ 5 r(x)}. (47) 


Note that g{x.) can be decomposed among different compo¬ 
nents of X, i.e., g{x.) = so the vector problem 

(47) reduces to K independent scalar subproblems that can be 
solved in parallel: 


C. LASSO 

In this subsection, we study the LASSO problem to illus¬ 
trate the advantage of the proposed line search method for 
nondifferentiable optimization problems. 

LASSO is an important and widely studied problem in 

^By stacking the pfc(A^’^)’s into the vector form p(A^’'^) = 

we can see that only element wise operations between vectors 
and matrix vector multiplications are involved. The simulations on which 
Figureare based are not performed in a real parallel computing environment 
with K processors, but only make use of the efficient linear algebraic 
implementations available in Matlab which already implicitly admits a certain 
level of parallelism. 


BfcX* = argmin {f{xk,xtk)+g{xk)} 

= 4(A^A)“^5^(r/c(x^)), /c = 1,..., AT, 

where dk{A^A) is the k-th diagonal element of A^A, 
‘5a(b) = [b — a]^ — [— b — a]^ is the so-called soft- 

thresholding operator Ell and 

r(x) = d(A^A) o X - A^(Ax - b) , (48) 

or more compactly: 

Bx* = (BfcX*)f=i = d(A'^A)“^ o 5^1 (r(x*)). (49) 
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Thus the update direction exhibits a closed-form expression. 
The stepsize based on the proposed exact line search ( [2^ is 

7 ^ = arg min {/(x^ + 7 (Bx^ — x^)) + — 5 f(x^))} 

0 < 7<1 

. fi||A(x* + 7(Bx‘-x*))-b||2l 

= arg mm < ^ , x ^ / 

0<7<i \ +7Ai(l|Bx*||i - ||x*||i) j 

(Ax* - b)^A(Bx* - X*) + /i(||Bx*||i - ||x*||71 ^ 
(A(Bx* - x*))^(A(Bx* - X*)) Jo‘ 

(50) 


The exact line search consists in solving a convex quadratic 
optimization problem with a scalar variable and a bound 
constraint, so the problem exhibits a closed-form solution 
( [50| . Therefore, both the update direction and stepsize can be 
calculated in closed-form. We name the proposed update ( [49] )- 
( 50| ) as Soft-Thresholding with Exact Line search Algorithm 
(STELA). 

The proposed update (|49|)-(|50|) has several desirable features 
that make it appealing in practice. Firstly, in each iteration, all 
elements are updated in parallel based on the nonlinear best- 
response ( |4^ . This is in the same spirit as |[T9j [38l and the 
convergence speed is generally faster than BCD 1^ or the 
gradient-based update ll40l . Secondly, the proposed exact line 
search ( [^ not only yields notable progress in each iteration 
but also enjoys an easy implementation given the closed-form 
expression. The convergence speed is thus further enhanced 
as compared to the procedures proposed in |[T9j [271 [38l where 
either decreasing stepsizes are used C9l or the line search is 
over the original nondifferentiable objective function in (|45) 
|27l|38l: 


i||A(x*+ 7 (Bx*-x*))-b ||2 


0<7<1 I +/i ||x^ + 


Computational complexity. The computational overhead 
associated with the proposed exact line search can 

significantly be reduced if ( [5Q| is carefully implemented as 
outlined in the following. The most complex operation in ( |50| 
is the matrix-vector multiplication, namely, Ax^ — b in the 
numerator and A(Bx^ — x^) in the denominator. On the one 
hand, the term Ax^ — b is already available from r(x^), which 
is computed in order to determine the best-response in ( |4^ . On 
the other hand, the matrix-vector multiplication A(Bx^ — x^) 
is also required for the computation of Ax^+^ — b as it can 
alternatively be computed as: 


Ax^+^ - b = A(x^ + 7 ^(Bx^ - x^)) - b 

= (Ax^ - b) 4 - 7 ^A(Bx^ - x^), (51) 


where then only an additional vector addition is involved. As 
a result, the stepsize ( |50| does not incur any additional matrix- 
vector multiplications, but only affordable vector-vector mul¬ 
tiplications. 

Signaling exchange. When A cannot be stored and pro¬ 
cessed by a centralized processing unit, a parallel architecture 
can be employed. Assume there are P + 1 (P > 2 ) processors. 



Figure 5. Operation flow and signaling exchange between local processor 
p and the central processor. A solid line indicates the computation that is 
locally performed by the central/local processor, and a solid line with an 
arrow indicates signaling exchange between the central and local processor 
and the direction of the signaling exchange. 


We label the first P processors as local processors and the last 
one as the central processor, and partition A as 


A= [Ai, Aa, ... ,Ap], 

where Ap e and ~ Matrix Ap is stored 

and processed in the local processor p, and the following 
computations are decomposed among the local processors: 


Ax = Ep=iApXp, 

(52a) 

(Ax - b) = (Aj(Ax - , 

(52b) 

d(A^A) = (d(AjA7)7i. 

(52c) 


where x^ G 


. The central processor computes the best- 
response Bx^ in (49) and the stepsize 7 ^ in (^). The decom¬ 
position in (^) enables us to analyze the signaling exchange 
between local processor p and the central processor involved 


in (49) and ( 50f 


The signaling exchange is summarized in Figure Firstly, 
the central processor sends Ax^ — b to the local processors 
(Sl.lfl and each local processor p for p = 1,..., P first 
computes Aj (Ax^ — b) and then sends it back to the c entral 
processor (S1.2), which forms A^(Ax^—b) (S1.3) as in ( 52b ) 
and calculates r(x^) as in ( [i^ (S1.4) and then Bx^ as in (49) 
(S1.5). Then the central processor sends Bx^ — x^ to the local 
processor p for p = 1 ,..., P (S2.1), and each local processor 
first computes Ap(Bx^ — xp and then sends it back to the 
cen tral p rocessor (S2.2), which forms A(Bx^ — x^) (S2.3) as 
in ( 52a ), calculates 7 ^ as in ( [^ (S2.4), and updates x^+^ 
(S3.1) and Ax^+^ — b (S3.2) according to (51). From Figure 


^Updates and |5o) can also be implemented by a parallel architecture 
without a central processor. In this case, the signaling is exchanged mutually 
between every two of the local processors, but the analysis is similar and the 
conclusion to be drawn remains same: the proposed exact line search {50) 
does not incur additional signaling compared with predetermined stepsizes. 
is set to = 0, so Ax° — b = —b. 

































































13 



number of iterations 

Figure 6. Convergence of STELA (proposed) and FLEXA (state-of-the-art) 
for LASSO: error versus the number of iterations. 


we observe that the exact line search ( [50| ) does not incur 
any additional signaling compared with that of predetermined 
stepsizes (e.g., constant and decreasing stepsize), because the 
signaling exchange in S2.1-S2.2 has also to be carried out in 
the computation of — b in S3.2, cf. (51). 

We finally remark that the proposed successive line search 
can also be applied and it exhibits a closed-form expression 
as well. However, since the exact line search yields faster 
convergence, we omit the details at this point. 

Simulations. We first compare in Figure the proposed 
algorithm STELA with FLEXA |[T9ll in terms of the error 
criterion e(x^) defined as: 


e(x‘) 4 ||v/(x‘) - [V/(x*) - (53) 

Note that x* is a solution of ( [4^ if and only if e(x*) = 0 
(281. FLEXA is implemented as outlined in ESI; however, the 
selective update scheme (m is not implemented in FLEXA 
because it is also applicable for STELA and it cannot eliminate 
the slow convergence and sensitivity of the decreasing stepsize. 
We also remark that the stepsize rule for FLEXA is 7 ^+^ = 
7 *(l-min(l, 10 “"‘/e(x*))(i 7 *) EH, where d is the decreasing 
rate and 7 ^ = 0.9. The code and the data generating the figure 
can be downloaded online (m. 

Note that the error e(x^) plotted in Figure does not 
necessarily decrease monotonically while the objective func¬ 
tion /(x^) -f always does. This is because STELA and 

FLEXA are descent direction methods. For FLEXA, when 
the decreasing rate is low (d = 10 “^), no improvement 
is observed after 100 iterations. As a matter of fact, the 
stepsize in those iterations is so large that the function value 
is actually dramatically increased, and thus the associated 
iterations are discarded in Figure A similar behavior is 
also observed for d = 10 “^, until the stepsize becomes 
sufficiently small. When the stepsize is quickly decreasing 
{d = 10 “^), although improvement is made in all iterations, 
the asymptotic convergence speed is slow because the stepsize 
is too small to make notable improvement. For this example, 
the choice d = 10 “^ performs well, but the value of a good 
decreasing rate depends on the parameter setup (e.g.. A, b and 


jj) and no general rule performs equally well for all choices 
of parameters. By comparison, the proposed algorithm STELA 
is fast to converge and exhibits stable performance without 
requiring any parameter tuning. 

We also compare in Figure |7] the proposed algorithm STELA 
with other competitive algorithms in literature: FISTA 03, 
ADMM E3, GreedyBCD (13 and SpaRSA (13. We sim¬ 
ulated GreedyBCD of ll4^ because it exhibits guaranteed 
convergence. The dimension of A is 2000 x 4000 (the left 
column of Figure |7]) and 5000 x 10000 (the right column). 
It is generated by the Matlab command randn with each 
row being normalized to unity. The density (the proportion of 
nonzero elements) of the sparse vector Xtme is 0.1 (the upper 
row of Figure |^, 0.2 (the middle row) and 0.4 (the lower 
row). The vector b is generated as b = Axtrue + e where e is 
drawn from an i.i.d. Gaussian distribution with variance 10“^. 
The regularization gain /i is set to /i = 0.1 ||A^b||^, which 
allows Xtrue to be recovered to a high accuracy (431. 

The simulations are carried out under Matlab R2012a on a 
PC equipped with an operating system of Windows 7 64-bit 
Home Premium Edition, an Intel 15-3210 2.50GHz CPU, and a 
8GB RAM. All of the Matlab codes are available online (4T1l . 
The comparison is made in terms of CPU time that is required 
until either a given error bound e(x^) < 10“^ is reached or the 
maximum number of iterations, namely, 2000, is reached. The 
running time consists of both the initialization stage required 
for preprocessing (represented by a fiat curve) and the formal 
stage in which the iterations are carried out. For exarmle, in 
the proposed algorithm STELA, d(A^A) is computecQin the 
initialization stage since it is required in the iterative variable 
update in the formal stage, cf. ( [49| ). The simulation results are 
averaged over 20 instances. 

We observe from Figure [ 7 ] that the proposed algorithm 
STELA converges faster than all competing algorithms. Some 
further observations are in order. 

• The proposed algorithm STELA is not sensitive to the 
density of the true signal Xtme. When the density is increased 
from 0.1 (left column) to 0.2 (middle column) and then to 0.4 
(right column), the CPU time increases negligibly. 

• The proposed algorithm STELA scales relatively well with 
the problem dimension. When the dimension of A is increased 
from 2000 x 4000 (the left column) to 5000 x 10000 (the right 
column), the CPU time is only marginally increased. 

• The initialization stage of ADMM is time consuming 
because of some expensive matrix operations as, e.g., AA^, 
(l + ^ AA^) ^ and A^ (l + AA^) ^ A (c is a given 
positive constant). More details can be found in (121 Sec. 
6.4]. Furthermore, the CPU time of the initialization stage of 
ADMM is increased dramatically when the dimension of A is 
increased from 2000 x 4000 to 5000 x 10000. 

• SpaRSA performs better when the density of Xt^e is 
smaller, e.g., 0.1, than in the case when it is large, e.g., 0.2 
and 0.4. 

• The asymptotic convergence speed of GreedyBCD is 

^The Matlab command is sum (A . ^2 , 1 ), so matrix-matrix multiplication 
between and A is not required. 
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Figure 7. Time versus error of different algorithms for LASSO. In the left 
and right column, the dimension of A is 2000 x 4000 and 5000 x 10000, 
respectively. In the higher, middle and lower column, the density of xtrue is 
0.1, 0.2 and 0.4. 


slow, because only one variable is updated in each iteration. 

To further evaluate the performance of the proposed al¬ 
gorithm STELA, we test it on the benchmarking platform 
developed by the Optimization Group from the Department of 
Mathematics at the Darmstadt University of Technolog}j^ and 
compare it with different algorithms in various setups (data set, 
problem dimension, etc.) for the basis pursuit (BP) problem 
I44l : 

minimize ||x||^ 
subject to Ax = b. 

To adapt STELAfor the BP problem, we use the augmented 
Lagrangian approach 112113: 

= l|x||i + (A5^(Ax-b) + L||Ax-b||2, 

= A* + c*(Ax*+i-b), 

where = min(2c*, 10^) (c° = 10/||A^b||^), x*+^ is 
computed by STELA and this process is repeated until A* 
converges. The numerical results summarized in m show 
that, although STELA must be called multiple times before the 
Lagrange multiplier A converges, the proposed algorithm for 
BP based on STELA is very competitive in terms of running 
time and robust in the sense that it solved all problem instances 
in the test platform database. 

V. Concluding Remarks 

In this paper, we have proposed a novel iterative algorithm 
based on convex approximation. The most critical requirement 

^Project website: http://wwwopt.mathematik.tu-darmstadt.de/spear/ 


on the approximate function is that it is pseudo-convex. On the 
one hand, the relaxation of the assumptions on the approximate 
functions can make the approximate problems much easier to 
solve. We show by a counter-example that the assumption 
on pseudo-convexity is tight in the sense that when it is 
violated, the algorithm may not converge. On the another hand, 
the stepsize based on the exact/successive line search yields 
notable progress in each iteration. Additional structures can be 
exploited to assist with the selection of the stepsize, so that 
the algorithm can be further accelerated. The advantages and 
benefits of the proposed algorithm have been demonstrated 
using prominent applications in communication networks and 
signal processing, and they are also numerically consolidated. 
The proposed algorithm can readily be applied to solve other 
problems as well, such as portfolio optimization Go). 


Appendix A 

Proof of Proposition [T] 

Proof: i) Firstly, suppose y is a stationary point of Q; it 
satisfies the first-order optimality condition: 

V/(y)'^(x - y) > 0, Vx G A. 

Using Assumption (A3), we get 

V/(y;yr(x-y) >0, Vxg A. 

Since /(•;y) is pseudo-convex, the above condition implies 

/(x;y) > /(y;y), Vx g A. 

That is, /(y; y) = minxe;^ /(x; y) and y G 5(y). 

Secondly, suppose y G 5(y). We readily get 

V/(y)^(x- y) = V/(y;y)'^(x- y) > 0, Vx G A, (54) 

where the equality and inequality comes from Assumption 
(A3) and the first-order optimality condition, respectively, so 
y is a stationary point of Q- 

ii) From the definition of ®x, it is either 

/(By;y) =/(y;y), (55a) 


or 


/(%;y)</(y;y), 


(55b) 


If (55a) is true, then y e S{y) and, as we have just shown, it is 
a stationary point of Q- So only ( [55b| ) can be true. We know 
from the pseudo-convexity of /(x; y) in x (cf. Assumption 
(Al)) and ( |55b| ) that By 7^ y and 

v/(y; y)^(By - y) = V/(y)^(By - y) < 0, (56) 


where the equality comes from Assumption (A3). 


Appendix B 
Proof of Theorem [2] 

Proof: Since Bx^ is the optimal point of ([^, it satisfies 
the first-order optimality condition: 

V/(Bx^ x^)^(x - Bx^) > 0, Vx G A. (57) 

If ( |55a| ) is true, then x^ G 5(x^) and it is a stationary point 
of ^ according to Proposition [^(i). Besides, it follows from 
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(54) (with X = Bx^ and y = x^) that V/(x^)^( 
Note that equality is actually achieved, i.e., 


^x^-x^) > 0. 


V/(x^ 


t\Tr 


- x^) = 0 


because otherwise Bx^ — x^ would be an ascent direction 
of /(x;x^) at X = x^ and the definition of Bx^ would be 
contradicted. Then from the definition of the successive line 
search, we can readily infer that 


Appendix C 
Proof of Theorem [3] 

Proof: We first need to show that Propositionstill holds, 
(i) We prove y is a stationary point of 0 if and only if 
yfe e argminxfceA’fc f{xk,y-k) for all k. 

Suppose y is a stationary point of it satisfies the first- 
order optimality condition: 

v/(y)^(x - y) = Ef=iVfe/(y)^(xfe - y^) > 0,Vx G A”, 


/(x*+i) < /(x*). (58) 

It is easy to see ( [58] ) holds for the exact line search as well. 

If ( |55b| ) is true, x^ is not a stationary point and Bx^ — x^ 
is a strict descent direction of /(x) at x = x^ according to 
Proposition (ii): /(x) is strictly decreased compared with 
/(x^) if X is updated at x^ along the direction Bx^ — x^. 
From the definition of the successive line search, there always 
exists a 7^ such that 0 < 7^ < 1 and 

/(x*+i) = /(x* + 7 *(Bx* - X*)) < /(x*). (59) 


and it is equivalent to 

Vjk/(y)^(xfe - yfe) > 0,Vxfe G Xk- 

Since /(x) is pseudo-convex in Xfe, the above condition 
implies f{yk,y-k) = f{xk,y-k) for all k = 

Suppose yfe G argminx^eA'fc/(xfe,y_fe) for all k = 
The first-order optimality conditions yields 

Vfe/(y)^(xfe - yfe) > 0,Vxfe G Xk- 

Adding the above inequality for all /c = 1,..., AT yields 


This strict decreasing property also holds for the exact line 
search because it is the stepsize that yields the largest decrease, 
which is always larger than or equal to that of the successive 
line search. 


We know from (58) and (59) that {/(x^)} is a monoton- 
ically decreasing sequence and it thus converges. Besides, 
for any two (possibly different) convergent subsequences 
{x‘}t 


iteTi 


and , the following holds: 


lim /(x^) = 

t^oo 


lim 

Ti —^cxD 


/(x*) 


lim 

T 2 3 ^—yoo 


/(x‘). 


V/(y)^(x-y) >0 ,VxGT-. 

Therefore, y is a stationary point of 

(ii) We prove that if y is not a stationary point of (j^, then 
V/(y)^(By - y) < 0. 

It follows from the optimality of B/^x that 

/(®fey,y-fe) < f{yk,y-k), 


and 


Vfe/(Bfey, y_fe)^(xfe - Bfey) > 0, Vxfe e -Tfe. (61) 


Since /(x) is a continuous function, we infer from the 
preceding equation that 


/ ( lim xM = / ( lim xM . (60) 

\ri3t^oo ) \r23t^oo J 

Now consider any convergent subsequence {x^}ter with 
limit point y, i.e., x^ = y. To show that y is 

a stationary point, we first assume the contrary: y is not 
a stationary point. Since /(x;x^) is continuous in both x 
and x^ by Assumption (A2) and {Bx is bounded by 
Assumption (A5), it follows from 1251 Th. 1] that there exists 
a sequence with Ts ^ T such that it converges 

and Iim7;3t^cx3 G 5(y). Since both /(x) and V/(x) are 
continuous, applying (251 Th. 1] again implies there is a Ts' 
such that Ts' ^ Ts{^ T) and converges to y' 

defined as: 

y' = y + p(®y-y), 


where p is the stepsize when either the exact or successive 
line search is applied to /(y) along the direction By — y. 
Since y is not a stationary point, it follows from ( |59| ) that 
f{y') < /(y)’ but this would contradict (60). Therefore y is 
a stationary point, and the proof is completed. ■ 


Firstly, there must exist an index j such that 

/(%y.y-i) < (62) 

otherwise y would be a stationary point of (|^. Since /(x) is 
pseudo-convex in x/^ fork = 1 ,..., Ff, it follows from ( [62] ) 
that 

Vi/(y)^(®iy - yj) < 0. (63) 

Secondly, for any index k such that /(Bfey,y_fe) = 
f{yk,y-k), yk minimizes /(xfe,y_fe) over Xfe G Xk and 
Vfe/(yfe,y-fe)^(xfe - yfe) >0 for any Xfe e T”. Setting 
Xfe = ®fey yields 

Vfe/(yfe, y_fe)^(Bfey - yfe) > 0. (64) 

Similarly, setting Xfe = yfe in ( |M] l yields 

Vfe/(®fey, y-fe)^(yfe - ®fey) > 0. (65) 

Adding ( |^ and (j^, we can infer that (Vfe/(y) — 
Vfe/(Bfey,y_fe))^(yfe - Bfey) > 0. Therefore, we can rewrite 
( (65l l as follows 

0 < Vfe/(Bfey,y_fe)'^(yfe - Bfey) 

= (Vfe/(Bfey,y_fe) - Vfe/(y) -f Vfe/(y))^(yfe - Bfey), 
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and thus 

Vfc/(y)^(Bfcy-yfe) < 

-(VA:/(Bfey,y-fc)-Vjk/(y))^(Bfey - yk) < 0. (66) 

Adding ( [63] ) and ( [66] ) over all k = 1,..., K yields 

V/(y)^(By - y) = Ef=iVfe/(y)^(Bfcy - y^) < 0. 

That is, By — y is a descent direction of f(x) at the point y. 

The proof of Theorem can then be used verbatim to 
prove the convergence of the algorithm with the approximate 
problem ( jST] ) and the exact/successive line search. ■ 
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